Motivation

Some facts about heart disease

  • Approximately 1.5 million heart attacks and strokes occur every year in the United States.
  • More than 800,000 people in the United States die from cardiovascular disease each year—that’s 1 in every 3 deaths, and about 160,000 of them occur in people under age 65.
  • Heart disease kills roughly the same number people in the United States each year as cancer, pneumonia, and accidents combined.
  • Heart disease and stroke cost the nation an estimated $555 billion in health care costs and lost productivity in 2015.

Here’s more:

Majority of Americans are already suffering from heart disease: Prevalence of heart disease

And…

About 90% of the risk for the first heart attack is because of conditions or lifestyle factors that are preventable

Lifestyle risk factors

Lifestyle risk factors

There is plenty of room for improvement when it comes to raising awareness for cardiovascular risk, based on a recent study that found one in five adults at risk for heart disease don’t recognize a need to improve their health. We believe that a better understanding of risk factors underlying health perceptions and behaviors is needed to capitalize on cardiovascular preventive efforts.

Goals of our project

  • Understanding risk factors for heart disease and stroke
  • Visualizing correlations of these risk factors with heart disease and stroke
  • Visualizing the distribution of risk factors and cardiovascular outcomes in New York City
  • Predicting risk of heart disease in an individual patient based on presence of risk factors

Initial questions

We sought to examine the risk factors for heart disease and their prevalence in New York City. We were also interested in examining novel ways of visualizing the correlation of indivisual risk factors with heart disease and stroke based on our knowledge gained in the Data Science class. In addition, we also wanted to find ways of predicting and visualizing an individual’s risk for heart disease. As such, our questions were as follows:

  • What are the risk factors for heart disease and stroke? How can we visualize the correlation of individual risk factors with heart disease and stroke?
  • What is the distribution of these risk factors and outcomes in New York city?
  • How can we predict and visualize an individual’s risk for heart disease?

Data

In order to address the first two questions, we utilized the 500 Cities: Local Data for Better Health dataset. The 500 Cities project is a collaboration between CDC, the Robert Wood Johnson Foundation, and the CDC Foundation. The purpose of the 500 Cities Project is to provide city- and census tract-level small area estimates for chronic disease risk factors, health outcomes, and clinical preventive service use for the largest 500 cities in the United States. These small area estimates will allow cities and local health departments to better understand the burden and geographic distribution of health-related variables in their jurisdictions, and assist them in planning public health interventions. Since we were interested in visualizing these data only for New York city, we filtered the dataset to include data only from New York city.

Risk factors that we’re interested in:: (15 total)

  • ACCESS2: Current lack of health insurance among adults aged 18<89>64 Years
  • BINGE: Binge drinking among adults aged >=18 Year
  • BPHIGH: High blood pressure among adults aged >=18 Year
  • BPMED: Taking medicine for high blood pressure control among adults aged >=18 Years with high blood pressure
  • OBESITY: Obesity among adults aged >=18 Years
  • CHECKUP: Visits to doctor for routine checkup within the past Year among adults aged >=18 Years
  • CHOLSCREEN: Cholesterol screening among adults aged >=18 Years
  • CSMOKING: Current smoking among adults aged >=18 Years
  • DIABETES: Diagnosed diabetes among adults aged >=18 Years
  • HIGHCHOL: High cholesterol among adults aged >=18 Years who have been screened in the past 5 Years
  • KIDNEY: Chronic kidney disease among adults aged >=18 Years
  • LPA: No leisure-time physical activity among adults aged >=18 Years
  • MHLTH: Mental health not good for >=14 days among adults aged >=18 Years
  • PHLTH: Physical health not good for >=14 days among adults aged >=18 Years
  • SLEEP: Sleeping less than 7 hours among adults aged >=18 Years

Outcomes that we’re interested in:: (2 total)

  • CHD: Coronary heart disease among adults aged >=18 Years
  • STROKE: Stroke among adults aged >=18 Years

In order to predict risk for heart disease for an individual, we utilized the Framigham Risk Score. The Framingham Risk Score is a gender-specific algorithm used to estimate the 10-year cardiovascular risk of an individual. The Framingham Risk Score was first developed based on data obtained from the Framingham Heart Study, to estimate the 10-year risk of developing coronary heart disease. We utilized the algorithm to develop visualization in ShinyApp as discussed below.

Exploratory Analysis

Loading and tidying the data

We scraped the data directly from the web as seen in the code below and called it cvrisk.

cvrisk_url = "https://data.cdc.gov/api/views/6vp6-wxuq/rows.csv?accessType=DOWNLOAD"

## Read in data from github
cvrisk = 
  read.csv(url(cvrisk_url)) %>% 
  janitor::clean_names() %>% 
  as_tibble()

We then restricted our database to New York City and included only the variables that were useful for further data visualization. We called this database nyc_cvrisk.

nyc_cvrisk = 
  cvrisk %>% 
  filter(state_desc == "New York", city_name == "New York",
         geographic_level == "Census Tract",
         !is.na(data_value),
         #year == 2016,
         measure_id %in% c("ACCESS2", "BINGE", "BPHIGH", "BPMED", "OBESITY", "CHECKUP", 
                         "CHOLSCREEN", "CSMOKING", "DIABETES", "HIGHCHOL", "KIDNEY", 
                         "LPA", "MHLTH", "PHLTH", "SLEEP", "CHD", "STROKE")) %>% 
  separate(geo_location, into = c("latitude", "longitude"), sep = ",") %>% 
  mutate(latitude = str_replace(latitude, "\\(", ""),
         longitude = str_replace(longitude, "\\)", "")) %>% 
  select(unique_id, latitude, longitude, population_count, measure_id, data_value, short_question_text)

Making the animated scatterplot

We first created another dataset that can be used for the animated scatterplot such that the outcome measure for coronary artery disease appears as a separate variable in the dataset.

nyc_cvrisk_limited = nyc_cvrisk %>% 
  filter(short_question_text %in% c("Annual Checkup", "Binge Drinking", "Cholesterol Screening", "Chronic Kidney Disease", "Current Smoking", "Diabetes", "Health Insurance", "High Blood Pressure", "High Cholesterol", "Mental Health", "Obesity", "Physical Health", "Physical Inactivity", "Sleep <7 hours", "Taking BP Medication"))

nyc_cad = nyc_cvrisk %>% 
  filter(short_question_text == "Coronary Heart Disease")

nyc_cvrisk_joined = left_join(nyc_cvrisk_limited, nyc_cad, by = "unique_id")

We then installed the package required for animation

devtools::install_github("dgrtwo/gganimate")

We then created an animated scatterplot with Coronary Heart DIsease prevalence on the X-axis and different risk factors on the Y-axis

theme_set(theme_bw())  # pre-set the bw theme.


## using transition_length and state_length
library(ggplot2)
library(gganimate)
p = nyc_cvrisk_joined %>%  
  ggplot(aes(x = data_value.y, y = data_value.x, frame = short_question_text.x)) +
  geom_point(aes(size = population_count.x, colour = short_question_text.x ),
             alpha = 0.5) + 
  xlim(0, 10) +
  
  labs(title = "{closest_state}",
       x = 'Coronary Heart Disease Prevalence', 
       y = 'Risk Factor Prevalence') +
  
  theme(plot.title = element_text(size = 40, face = "bold"),
        axis.text=element_text(size=18),
        axis.title=element_text(size=18,face="bold")) + 
  theme(legend.text=element_text(size=16)) +
  
  # gganimate parts
  transition_states(short_question_text.x, transition_length = 1, state_length = 50000, wrap = TRUE) +
  enter_fade() +
  exit_fade()

animate(p, fps = 1, height = 800, width = 800)

Making the correlation plot

To make the correlation plot, we first transformed our dataset in the wide format and called it nyc_cvrisk_wide.

nyc_cvrisk_wide = 
  nyc_cvrisk %>% 
  select(-measure_id)%>%
  spread(key = short_question_text, value = data_value) %>% 
  janitor::clean_names()

The correlation plot for coronary heart disease and risk factors of interest is as follows:

nyc_cvrisk_wide %>% 
  select(annual_checkup:sleep_7_hours) %>% 
  select("coronary_heart_disease", everything())%>%
  rename("CHD" = coronary_heart_disease,
         "Annual Checkup" = annual_checkup,
         "Binge Drinking" = binge_drinking,
         "Kidney Disease" = chronic_kidney_disease,
         "Current Smoking" = current_smoking,
         "Diabetes" = diabetes,
         "No Insurance" = health_insurance,
         "Poor Mental Health" = mental_health,
         "Obesity" = obesity,
         "Poor Health" = physical_health,
         "Physical Inactivity" = physical_inactivity,
         "Poor Sleep" = sleep_7_hours)%>%
  cor() %>% 
  corrplot(., method="circle")

The correlation plot for stroke and risk factors of interest is as follows:

nyc_cvrisk_wide %>% 
  select(annual_checkup:stroke)%>%
  select(stroke, everything()) %>% 
  select(-c(coronary_heart_disease))%>%
  rename("Stroke" = stroke,
         "Annual Checkup" = annual_checkup,
         "Binge Drinking" = binge_drinking,
         "Kidney Disease" = chronic_kidney_disease,
         "Current Smoking" = current_smoking,
         "Diabetes" = diabetes,
         "No Insurance" = health_insurance,
         "Poor Mental Health" = mental_health,
         "Obesity" = obesity,
         "Poor Health" = physical_health,
         "Physical Inactivity" = physical_inactivity,
         "Poor Sleep" = sleep_7_hours)%>%
  cor() %>% 
  corrplot(., method="circle")

Additional Analysis

We used the Shiny Application and leaflet package to perform visualization of geographic distribution of several risk factors and outcomes in New York City.

We also used Shiny Application to build a program to calculate individual risk of heart disease using the Framingham Risk Score

Discussion